In [1]:
from google.colab import drive
drive.mount('/content/drive',force_remount=True)
Mounted at /content/drive
In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Q1.A Read ‘Car name.csv’ as a DataFrame and assign it to a variable

In [3]:
df1 = pd.read_csv('/content/drive/MyDrive/Car_name.csv')
In [4]:
df1.head()
Out[4]:
car_name
0 chevrolet chevelle malibu
1 buick skylark 320
2 plymouth satellite
3 amc rebel sst
4 ford torino

Q1.B Read ‘Car-Attributes.json as a DataFrame and assign it to a variable.

In [5]:
df2 = pd.read_json('/content/drive/MyDrive/Car-Attributes.json')
In [6]:
df2.head()
Out[6]:
mpg cyl disp hp wt acc yr origin
0 18.0 8 307.0 130 3504 12.0 70 1
1 15.0 8 350.0 165 3693 11.5 70 1
2 18.0 8 318.0 150 3436 11.0 70 1
3 16.0 8 304.0 150 3433 12.0 70 1
4 17.0 8 302.0 140 3449 10.5 70 1

1.C Merge both the DataFrames together to form a single DataFrame

In [7]:
print("The Shape of car name df1 dataframe", df1.shape)
print("The Shape of car attribute name df2 dataframe", df2.shape)
The Shape of car name df1 dataframe (398, 1)
The Shape of car attribute name df2 dataframe (398, 8)
In [8]:
# As the number of rows are same on both dataset.so directlymere both dataset.
df = [df1,df2]
df = pd.concat(df,axis=1)
df.head()
Out[8]:
car_name mpg cyl disp hp wt acc yr origin
0 chevrolet chevelle malibu 18.0 8 307.0 130 3504 12.0 70 1
1 buick skylark 320 15.0 8 350.0 165 3693 11.5 70 1
2 plymouth satellite 18.0 8 318.0 150 3436 11.0 70 1
3 amc rebel sst 16.0 8 304.0 150 3433 12.0 70 1
4 ford torino 17.0 8 302.0 140 3449 10.5 70 1
In [9]:
print("The shapeof new combined dataset",df.shape)
The shapeof new combined dataset (398, 9)

Q1.D Print 5 point summary of the numerical features and share insights.

In [10]:
df.describe().T
Out[10]:
count mean std min 25% 50% 75% max
mpg 398.0 23.514573 7.815984 9.0 17.500 23.0 29.000 46.6
cyl 398.0 5.454774 1.701004 3.0 4.000 4.0 8.000 8.0
disp 398.0 193.425879 104.269838 68.0 104.250 148.5 262.000 455.0
wt 398.0 2970.424623 846.841774 1613.0 2223.750 2803.5 3608.000 5140.0
acc 398.0 15.568090 2.757689 8.0 13.825 15.5 17.175 24.8
yr 398.0 76.010050 3.697627 70.0 73.000 76.0 79.000 82.0
origin 398.0 1.572864 0.802055 1.0 1.000 1.0 2.000 3.0

hp column analysis is missing.Check for any impute data.

In [11]:
imputeHP = pd.DataFrame(df.hp.str.isdigit())
df[imputeHP['hp'] == False]
Out[11]:
car_name mpg cyl disp hp wt acc yr origin
32 ford pinto 25.0 4 98.0 ? 2046 19.0 71 1
126 ford maverick 21.0 6 200.0 ? 2875 17.0 74 1
330 renault lecar deluxe 40.9 4 85.0 ? 1835 17.3 80 2
336 ford mustang cobra 23.6 4 140.0 ? 2905 14.3 80 1
354 renault 18i 34.5 4 100.0 ? 2320 15.8 81 2
374 amc concord dl 23.0 4 151.0 ? 3035 20.5 82 1

As it is only 6 rows which does not have hpvalue so dropping these 6 rows.

In [12]:
df.drop(df[df['hp']=='?'].index, inplace=True)
df.head()
Out[12]:
car_name mpg cyl disp hp wt acc yr origin
0 chevrolet chevelle malibu 18.0 8 307.0 130 3504 12.0 70 1
1 buick skylark 320 15.0 8 350.0 165 3693 11.5 70 1
2 plymouth satellite 18.0 8 318.0 150 3436 11.0 70 1
3 amc rebel sst 16.0 8 304.0 150 3433 12.0 70 1
4 ford torino 17.0 8 302.0 140 3449 10.5 70 1
In [13]:
#convert hp fromobject to float.
df['hp'] = pd.to_numeric(df['hp'])
df.describe().T
Out[13]:
count mean std min 25% 50% 75% max
mpg 392.0 23.445918 7.805007 9.0 17.000 22.75 29.000 46.6
cyl 392.0 5.471939 1.705783 3.0 4.000 4.00 8.000 8.0
disp 392.0 194.411990 104.644004 68.0 105.000 151.00 275.750 455.0
hp 392.0 104.469388 38.491160 46.0 75.000 93.50 126.000 230.0
wt 392.0 2977.584184 849.402560 1613.0 2225.250 2803.50 3614.750 5140.0
acc 392.0 15.541327 2.758864 8.0 13.775 15.50 17.025 24.8
yr 392.0 75.979592 3.683737 70.0 73.000 76.00 79.000 82.0
origin 392.0 1.576531 0.805518 1.0 1.000 1.00 2.000 3.0

2. Data Preparation & Analysis:

2.A Check and print feature-wise percentage of missing values present in the data and impute with the best suitable approach.

In [14]:
df.isna().sum()/len(df)*100
Out[14]:
car_name    0.0
mpg         0.0
cyl         0.0
disp        0.0
hp          0.0
wt          0.0
acc         0.0
yr          0.0
origin      0.0
dtype: float64

The dataset has no missing values

Q2.B Check for duplicate values in the data and impute with the best suitable approach

In [15]:
df.duplicated().sum()
Out[15]:
0

No row is duplicated.

Q2.C Plot a pairplot for all features.

In [16]:
sns.pairplot(df,hue='cyl',palette='tab10')
Out[16]:
<seaborn.axisgrid.PairGrid at 0x7c9ee23d5240>

Q2.D Visualize a scatterplot for ‘wt’ and ‘disp’. Datapoints should be distinguishable by ‘cyl’.

In [17]:
sns.scatterplot(x=df['wt'],y=df['disp'],hue=df['cyl'],palette='tab10')
Out[17]:
<Axes: xlabel='wt', ylabel='disp'>

Q2.E Share insights for Q2.d.

There is a positive correlation between "wt" and "disp". If number of cylinders are increased, "wt: and "disp" is also increased. In dataset 3 and 5 cylider vehciles count is very less.

Q2.F Visualize a scatterplot for ‘wt’ and ’mpg’. Datapoints should be distinguishable by ‘cyl’.

In [18]:
sns.scatterplot(x=df['wt'],y=df['mpg'],hue=df['cyl'],palette='tab10')
Out[18]:
<Axes: xlabel='wt', ylabel='mpg'>

Ques 2. G Share insights for Q2.f.

1.There is a negative correlation between mpg and wt.

  1. More number of cylinders have heavy weight.
  2. If weight is increased (for more number of cylinder vehciles), "mpg" decreased.

Q2.H Check for unexpected values in all the features and datapoints with such values

Solution is covered in 1.D answer. There was 6 rows which has ? value. I dropped them.

In [19]:
Numeric_columns=['mpg', 'cyl', 'disp', 'hp', 'wt', 'acc', 'yr', 'origin']
for  i in Numeric_columns:
  Q1=df[i].quantile(0.25)
  Q3=df[i].quantile(0.75)
  IQR=Q3-Q1
  lower = Q1 - 1.5*IQR
  upper = Q3 + 1.5*IQR
  outiler_status=((df[i]<lower)|(df[i]>upper)).sum()
  print("The outlier count of", i,"is", outiler_status)
The outlier count of mpg is 0
The outlier count of cyl is 0
The outlier count of disp is 0
The outlier count of hp is 10
The outlier count of wt is 0
The outlier count of acc is 11
The outlier count of yr is 0
The outlier count of origin is 0

Hp and Acc columns has outliers.

In [20]:
df.isna().sum()
Out[20]:
car_name    0
mpg         0
cyl         0
disp        0
hp          0
wt          0
acc         0
yr          0
origin      0
dtype: int64

Clustering

In [21]:
from sklearn.cluster import KMeans
from scipy.stats import zscore
In [22]:
#drop the car name column
df_1= df.drop(['yr','origin','car_name'],axis=1)
#Scale the data
df_scaled = df_1.apply(zscore)
df_scaled.sample(5)
Out[22]:
mpg cyl disp hp wt acc
25 -1.724931 1.483947 1.584416 2.875254 1.930190 -0.559396
139 -1.211785 1.483947 1.029447 0.924265 1.957303 0.166467
168 -0.057205 -0.864014 -0.520637 -0.558487 -0.399124 0.529398
395 1.097374 -0.864014 -0.568479 -0.532474 -0.804632 -1.430430
69 -1.468358 1.483947 1.488732 1.444529 1.742760 -0.740861
  1. 3A. Apply K-Means clustering for 2 to 10 clusters.
  2. 3.B Plot a visual and find elbow point.
  3. 3.C On the above visual, highlight which are the possible Elbow points
In [23]:
#As per the question I am running the loop from 2 to 11 to apply  K-Means clustering for 2 to 10 clusters.
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings('ignore')

clusters = []

for i in range(2, 11):
    kmodel = KMeans(n_clusters=i,random_state=1).fit(df_scaled)
    clusters.append(kmodel.inertia_)
    print("The K means model inertia for cluster",i,"is",kmodel.inertia_)

fig, ax = plt.subplots(figsize=(4, 4))
sns.lineplot(x=list(range(2, 11)), y=clusters, ax=ax,marker='o')
ax.set_title('Find the elbow point')
ax.set_xlabel('Clusters')
ax.set_ylabel('Inertia')
The K means model inertia for cluster 2 is 927.6954635551294
The K means model inertia for cluster 3 is 596.5247365547316
The K means model inertia for cluster 4 is 482.10788335892926
The K means model inertia for cluster 5 is 416.4942101850577
The K means model inertia for cluster 6 is 360.2140165537845
The K means model inertia for cluster 7 is 327.2083846469888
The K means model inertia for cluster 8 is 295.9354327620717
The K means model inertia for cluster 9 is 278.73123357763905
The K means model inertia for cluster 10 is 264.2567939129071
Out[23]:
Text(0, 0.5, 'Inertia')
In [24]:
# for better under standing Again running  K-Means clustering for 1 to 10 clusters.
clusters = []

for i in range(1, 11):
    kmodel = KMeans(n_clusters=i,random_state=1).fit(df_scaled)
    clusters.append(kmodel.inertia_)
    print("The K means model inertia for cluster",i,"is",kmodel.inertia_)

fig, ax = plt.subplots(figsize=(4, 4))
sns.lineplot(x=list(range(1, 11)), y=clusters, ax=ax,marker='o')
ax.set_title('Find the elbow point')
ax.set_xlabel('Clusters')
ax.set_ylabel('Inertia')
arrowprops = dict(
    arrowstyle = "->",
    connectionstyle = "angle, angleA =-180, angleB = 45,\
    rad = 10")
offset = 50
ax.annotate("possible points",
            (3, 596), xytext =(offset, offset),
            textcoords ='offset points',arrowprops = arrowprops)
ax.annotate("possible points",
            (5, 416), xytext =( offset, offset),
            textcoords ='offset points',arrowprops = arrowprops)
ax.annotate("possible points",
            (6, 360), xytext =( offset, offset),
            textcoords ='offset points',arrowprops = arrowprops)
The K means model inertia for cluster 1 is 2352.000000000001
The K means model inertia for cluster 2 is 927.6954635551294
The K means model inertia for cluster 3 is 596.5247365547316
The K means model inertia for cluster 4 is 482.10788335892926
The K means model inertia for cluster 5 is 416.4942101850577
The K means model inertia for cluster 6 is 360.2140165537845
The K means model inertia for cluster 7 is 327.2083846469888
The K means model inertia for cluster 8 is 295.9354327620717
The K means model inertia for cluster 9 is 278.73123357763905
The K means model inertia for cluster 10 is 264.2567939129071
Out[24]:
Text(50, 50, 'possible points')
In [25]:
from sklearn.metrics import silhouette_samples, silhouette_score
df_scaled.sample(5)
Out[25]:
mpg cyl disp hp wt acc
371 0.712514 -0.864014 -0.568479 -0.532474 -0.533507 0.166467
138 -1.211785 1.483947 1.182542 1.184397 1.743939 -0.740861
312 1.764465 -0.864014 -1.037332 -1.026725 -1.129982 0.311639
210 -0.570352 0.309967 -0.367542 0.091842 -0.056092 -0.014999
215 -1.340071 1.483947 1.182542 1.184397 0.916420 -0.559396

The possibl elbow points can be 3,5,6

Q3.D Train a K-means clustering model once again on the optimal number of clusters. Q3.E. Add a new feature in the DataFrame which will have labels based upon cluster value. Q3.F. Plot a visual and color the datapoints based upon clusters.

In [26]:
p=[2,3,4,5,6,7,8]
for f in p:
  km= KMeans(n_clusters=f,random_state=1).fit(df_scaled)
  labels = km.labels_
  print("Silhouette_score for elbow  point",f,"is",silhouette_score(df_scaled,labels))
Silhouette_score for elbow  point 2 is 0.5450184683536872
Silhouette_score for elbow  point 3 is 0.44234710113179243
Silhouette_score for elbow  point 4 is 0.3816218513467549
Silhouette_score for elbow  point 5 is 0.36979033611463874
Silhouette_score for elbow  point 6 is 0.33150999377045476
Silhouette_score for elbow  point 7 is 0.3041083061385893
Silhouette_score for elbow  point 8 is 0.2957075837064921
In [27]:
import matplotlib.pyplot as plt
import seaborn as sns

width = 8
height = 4
s_score=[]
sns.set(rc = {'figure.figsize':(width,height)})
elbow_points=[3,5,6]

for i in elbow_points:
  km= KMeans(n_clusters=i,random_state=10).fit(df_scaled)
  z='Labels_'+str(i)
  df_scaled[z] = km.labels_
  fig, axes = plt.subplots(1, 2)
  fig.subplots_adjust(hspace=0.125, wspace=.5)
  sns.scatterplot(x=df_scaled['wt'], y=df_scaled['mpg'], hue=df_scaled[z], ax = axes[0])
  sns.scatterplot(x=df_scaled['wt'], y=df_scaled['hp'], hue=df_scaled[z], ax = axes[1])
In [28]:
df_scaled.head()
Out[28]:
mpg cyl disp hp wt acc Labels_3 Labels_5 Labels_6
0 -0.698638 1.483947 1.077290 0.664133 0.620540 -1.285258 1 1 5
1 -1.083498 1.483947 1.488732 1.574594 0.843334 -1.466724 1 1 5
2 -0.698638 1.483947 1.182542 1.184397 0.540382 -1.648189 1 1 5
3 -0.955212 1.483947 1.048584 1.184397 0.536845 -1.285258 1 1 5
4 -0.826925 1.483947 1.029447 0.924265 0.555706 -1.829655 1 1 5
In [29]:
df_1.tail(5)
Out[29]:
mpg cyl disp hp wt acc
393 27.0 4 140.0 86 2790 15.6
394 44.0 4 97.0 52 2130 24.6
395 32.0 4 135.0 84 2295 11.6
396 28.0 4 120.0 79 2625 18.6
397 31.0 4 119.0 82 2720 19.4
In [30]:
df_scaled.tail()
Out[30]:
mpg cyl disp hp wt acc Labels_3 Labels_5 Labels_6
393 0.455941 -0.864014 -0.520637 -0.480448 -0.221125 0.021294 2 4 2
394 2.636813 -0.864014 -0.932079 -1.364896 -0.999134 3.287676 2 3 4
395 1.097374 -0.864014 -0.568479 -0.532474 -0.804632 -1.430430 2 0 3
396 0.584228 -0.864014 -0.712005 -0.662540 -0.415627 1.110088 2 3 4
397 0.969088 -0.864014 -0.721574 -0.584501 -0.303641 1.400433 2 3 4

3G. Pass a new DataPoint and predict which cluster it belongs to.

In [31]:
km_3 = KMeans(n_clusters=3, random_state=10).fit(df_1)
cluster=km_3.predict([[28,4,120,79,2625,18.6]])[0]
cluster
Out[31]:
2
In [32]:
km_6 = KMeans(n_clusters=6, random_state=5).fit(df_1)
cluster=km_6.predict([[28,4,120,79,2625,18.6]])[0]
cluster
Out[32]:
4
In [33]:
km_5 = KMeans(n_clusters=5).fit(df_1)
cluster=km_5.predict([[30,6,123,120,3000,18.6]])[0]
cluster
Out[33]:
0

PART- B

In [34]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from scipy.stats import zscore
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn import preprocessing
from sklearn.metrics import roc_auc_score
from sklearn import metrics
from sklearn.metrics import classification_report

1. Data Understanding & Cleaning:

A. Read ‘vehicle.csv’ and save as DataFrame.

In [35]:
df_vehicle = pd.read_csv('/content/drive/MyDrive/vehicle.csv')
df_vehicle.head()
Out[35]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
In [36]:
df_vehicle.shape
Out[36]:
(846, 19)

1.B Check percentage of missing values and impute with correct approach.

In [37]:
df_vehicle.isna().sum()/len(df_vehicle)*100
Out[37]:
compactness                    0.000000
circularity                    0.591017
distance_circularity           0.472813
radius_ratio                   0.709220
pr.axis_aspect_ratio           0.236407
max.length_aspect_ratio        0.000000
scatter_ratio                  0.118203
elongatedness                  0.118203
pr.axis_rectangularity         0.354610
max.length_rectangularity      0.000000
scaled_variance                0.354610
scaled_variance.1              0.236407
scaled_radius_of_gyration      0.236407
scaled_radius_of_gyration.1    0.472813
skewness_about                 0.709220
skewness_about.1               0.118203
skewness_about.2               0.118203
hollows_ratio                  0.000000
class                          0.000000
dtype: float64
In [38]:
df_vehicle.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   compactness                  846 non-null    int64  
 1   circularity                  841 non-null    float64
 2   distance_circularity         842 non-null    float64
 3   radius_ratio                 840 non-null    float64
 4   pr.axis_aspect_ratio         844 non-null    float64
 5   max.length_aspect_ratio      846 non-null    int64  
 6   scatter_ratio                845 non-null    float64
 7   elongatedness                845 non-null    float64
 8   pr.axis_rectangularity       843 non-null    float64
 9   max.length_rectangularity    846 non-null    int64  
 10  scaled_variance              843 non-null    float64
 11  scaled_variance.1            844 non-null    float64
 12  scaled_radius_of_gyration    844 non-null    float64
 13  scaled_radius_of_gyration.1  842 non-null    float64
 14  skewness_about               840 non-null    float64
 15  skewness_about.1             845 non-null    float64
 16  skewness_about.2             845 non-null    float64
 17  hollows_ratio                846 non-null    int64  
 18  class                        846 non-null    object 
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB
In [39]:
df_vehicle = df_vehicle.replace(' ', np.nan)
for i in df_vehicle.columns[:18]:
    var = df_vehicle[i].median()
    df_vehicle[i] = df_vehicle[i].fillna(var)
In [40]:
df_vehicle.isna().sum()/len(df_vehicle)*100
Out[40]:
compactness                    0.0
circularity                    0.0
distance_circularity           0.0
radius_ratio                   0.0
pr.axis_aspect_ratio           0.0
max.length_aspect_ratio        0.0
scatter_ratio                  0.0
elongatedness                  0.0
pr.axis_rectangularity         0.0
max.length_rectangularity      0.0
scaled_variance                0.0
scaled_variance.1              0.0
scaled_radius_of_gyration      0.0
scaled_radius_of_gyration.1    0.0
skewness_about                 0.0
skewness_about.1               0.0
skewness_about.2               0.0
hollows_ratio                  0.0
class                          0.0
dtype: float64

All the missing values are handled.

1.C Visualize a Pie-chart and print percentage of values for variable ‘class’.

In [41]:
df_vehicle['class'].value_counts()
Out[41]:
car    429
bus    218
van    199
Name: class, dtype: int64
In [42]:
df_vehicle['class'].value_counts().plot(kind='pie',autopct='%1.1f%%')
Out[42]:
<Axes: ylabel='class'>

1.D Check for duplicate rows in the data and impute with correct approach

In [43]:
df_vehicle.duplicated().sum()
Out[43]:
0

2. Data Preparation:

A. Split data into X and Y. [Train and Test optional]

In [44]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder().fit(df_vehicle['class'])
df_vehicle['class'] = le.transform(df_vehicle['class'])
df_vehicle.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   compactness                  846 non-null    int64  
 1   circularity                  846 non-null    float64
 2   distance_circularity         846 non-null    float64
 3   radius_ratio                 846 non-null    float64
 4   pr.axis_aspect_ratio         846 non-null    float64
 5   max.length_aspect_ratio      846 non-null    int64  
 6   scatter_ratio                846 non-null    float64
 7   elongatedness                846 non-null    float64
 8   pr.axis_rectangularity       846 non-null    float64
 9   max.length_rectangularity    846 non-null    int64  
 10  scaled_variance              846 non-null    float64
 11  scaled_variance.1            846 non-null    float64
 12  scaled_radius_of_gyration    846 non-null    float64
 13  scaled_radius_of_gyration.1  846 non-null    float64
 14  skewness_about               846 non-null    float64
 15  skewness_about.1             846 non-null    float64
 16  skewness_about.2             846 non-null    float64
 17  hollows_ratio                846 non-null    int64  
 18  class                        846 non-null    int64  
dtypes: float64(14), int64(5)
memory usage: 125.7 KB
In [45]:
df_vehicle.head()
Out[45]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 2
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 2
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 1
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 2
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 0
In [46]:
#Train data
X = df_vehicle.drop('class',axis=1)
#Test data
y=df_vehicle['class']

B. Standardize the Data.

In [47]:
# Scaling the independent attributes using zscore
X_scaled=X.apply(zscore)

3. Model Building:

In [48]:
xtrain, xtest, ytrain, ytest = train_test_split(X,y, test_size = 0.3, random_state = 10)

3.A Train a base Classification model using SVM.

3.B Print Classification metrics for train data.

In [49]:
def performance_analysis(a,b):
    q=[]
    q.append(accuracy_score(a, b))
    q.append(precision_score(a, b,average="macro"))
    q.append(recall_score(a, b,average="macro"))
    q.append(f1_score(a, b,average="macro"))
    q.append(multiclass_roc_auc_score(a, b, average="macro"))
    return q

def test_train_analysis(ytrain,ytest,predict_train,predict_test):
    train=performance_analysis(ytrain,predict_train)
    test=performance_analysis(ytest,predict_test)
    data= { 'train' : train,
            'test' :test
       }
    Name= ['Accuracy', 'Recall', 'Precision', 'F1-score',"roc_auc_score"]
    index=Name
    df2 = pd.DataFrame(data, index)
    df2.reset_index(inplace = True)
    display(df2)


def conf_metrix(y,pred):
    cm = metrics.confusion_matrix(y, pred, labels=[0, 1,2])
    df_cm = pd.DataFrame(cm, index = [i for i in ["Car","Bus","Van"]],
                  columns = [i for i in ["Car","Bus","Van"]])
    plt.figure(figsize = (7,5))
    sns.heatmap(df_cm, annot=True,fmt= 'd')
    plt.show()

def multiclass_roc_auc_score(y_test, y_pred, average="macro"):
    lb = preprocessing.LabelBinarizer()
    lb.fit(y_test)
    y_test = lb.transform(y_test)
    y_pred = lb.transform(y_pred)
    return roc_auc_score(y_test, y_pred, average=average)

def complete_analysis(X_train,X_test,y_train,y_test,ML):
  pred_train = ML.predict(X_train)
  pred_test = ML.predict(X_test)
  test_train_analysis(y_train,y_test,pred_train,pred_test)
  conf_metrix(y_test,pred_test )
  print("Classification report on training data=================================")
  print(classification_report(y_train,pred_train ))
  print("Classification report on test data=================================")
  print(classification_report(y_test,pred_test ))

def data_analysis(y,predicted_x):
    result=performance_analysis(y,predicted_x)
    data= { 'performance' : result,
                  }
    Name= ['Accuracy', 'Recall', 'Precision', 'F1-score',"roc_auc_score"]
    index=Name
    df2 = pd.DataFrame(data, index)
    df2.reset_index(inplace = True)
    display(df2)
In [50]:
svc = SVC(random_state=10)
svc=svc.fit(xtrain, ytrain)
complete_analysis(xtrain,xtest,ytrain,ytest,svc)
index train test
0 Accuracy 0.685811 0.649606
1 Recall 0.655186 0.619053
2 Precision 0.656347 0.617826
3 F1-score 0.649558 0.614685
4 roc_auc_score 0.746460 0.716827
Classification report on training data=================================
              precision    recall  f1-score   support

           0       0.62      0.48      0.54       147
           1       0.79      0.77      0.78       304
           2       0.55      0.72      0.63       141

    accuracy                           0.69       592
   macro avg       0.66      0.66      0.65       592
weighted avg       0.69      0.69      0.68       592

Classification report on test data=================================
              precision    recall  f1-score   support

           0       0.59      0.46      0.52        71
           1       0.74      0.77      0.75       125
           2       0.53      0.62      0.57        58

    accuracy                           0.65       254
   macro avg       0.62      0.62      0.61       254
weighted avg       0.65      0.65      0.65       254

3C. Apply PCA on the data with 10 components.

3D. Visualize Cumulative Variance Explained with Number of Components.

3E. Draw a horizontal line on the above plot to highlight the threshold of 90%.

In [51]:
from sklearn.decomposition import PCA
pca=PCA(n_components=10,random_state=10)
pca_model=pca.fit_transform(X_scaled)
In [52]:
#calculate the variance
var_explained_per=pca.explained_variance_/np.sum(pca.explained_variance_)
print("Variance_explained_variance=",var_explained_per)
#Cummulative Sum
cum_var_explained=np.cumsum(var_explained_per)
print("Cummuative_vaiance_explained=",cum_var_explained)
plt.plot(cum_var_explained,marker='*',markerfacecolor='black', markersize=8)
plt.axhline(y = .9)
plt.xlabel('n_components')
plt.ylabel('Cummuative_vaiance_explained')
plt.show()
Variance_explained_variance= [0.52854121 0.16943943 0.10697862 0.0663128  0.0515503  0.03034773
 0.0201686  0.01247266 0.00902625 0.0051624 ]
Cummuative_vaiance_explained= [0.52854121 0.69798064 0.80495926 0.87127206 0.92282236 0.95317009
 0.97333869 0.98581135 0.9948376  1.        ]

3F. Apply PCA on the data. This time Select Minimum Components with 90% or above variance explained.

As per above graph for forn_component=4 satisfies the minimum component 90or above condition.

In [53]:
pca=PCA(n_components=4, random_state=10)
pca_model_4=pca.fit_transform(X_scaled)

3.G Train SVM model on components selected from above step

3.H Print Classification metrics for train data of above model and share insights

In [54]:
svc_pca3 = SVC()
svc_pca3=svc_pca3.fit(pca_model_4,y)
svc_pca_model = svc_pca3.predict(pca_model_4)
print(classification_report(y,svc_pca_model))
data_analysis(y,svc_pca_model)
conf_metrix(y,svc_pca_model)
              precision    recall  f1-score   support

           0       0.85      0.66      0.74       218
           1       0.83      0.90      0.86       429
           2       0.67      0.73      0.70       199

    accuracy                           0.80       846
   macro avg       0.79      0.76      0.77       846
weighted avg       0.80      0.80      0.79       846

index performance
0 Accuracy 0.796690
1 Recall 0.786608
2 Precision 0.762210
3 F1-score 0.769622
4 roc_auc_score 0.825663

4. Performance Improvement:

A. Train another SVM on the components out of PCA. Tune the parameters to improve performance B. Share best Parameters observed from above step.

C.Print Classification metrics for train data of above model and share relative improvement in performance in all the models along with insights.

Grid Search SVM on PCA model

In [55]:
svc_grid = SVC(random_state=10)

param_grid = {'C': [0.1, 1, 10],
             'gamma': [1, 0.1, 0.01],
              'kernel': ['sigmoid','rbf','linear','poly']}

grid_svc = GridSearchCV(svc_grid, param_grid)
grid_svc_pca=grid_svc.fit(pca_model_4, y)
print("Params",grid_svc_pca.get_params)
print("Best Params",grid_svc_pca.best_params_)
svc_pca=grid_svc_pca.predict(pca_model_4)
print(classification_report(y,svc_pca))
data_analysis(y,svc_pca)
conf_metrix(y,svc_pca)
Params <bound method BaseEstimator.get_params of GridSearchCV(estimator=SVC(random_state=10),
             param_grid={'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.01],
                         'kernel': ['sigmoid', 'rbf', 'linear', 'poly']})>
Best Params {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
              precision    recall  f1-score   support

           0       0.88      0.81      0.84       218
           1       0.88      0.92      0.90       429
           2       0.78      0.77      0.77       199

    accuracy                           0.86       846
   macro avg       0.85      0.83      0.84       846
weighted avg       0.86      0.86      0.86       846

index performance
0 Accuracy 0.856974
1 Recall 0.845519
2 Precision 0.833839
3 F1-score 0.839137
4 roc_auc_score 0.878167
In [56]:
def summary_table(models,xtrain,xtest,ytrain,ytest):
  df_S = pd.DataFrame(index=['Accuracy', 'Recall', 'Precision', 'F1-score',"roc_auc_score"])
  for i in models:
    x=performance_analysis(ytrain,i.predict(xtrain))
    df_S [str(i)[0:3]+"_Train"] = x
    y=performance_analysis(ytest,i.predict(xtest))
    df_S [str(i)[0:3]+"_Test"]=y
  return df_S

def summary_table2(models,x,y):
  df_S = pd.DataFrame(index=['Accuracy', 'Recall', 'Precision', 'F1-score',"roc_auc_score"])
  for i in models:
    y=performance_analysis(y,i.predict(x))
    df_S [str(i)[0:9]+"result"]=y
  return df_S
In [57]:
models=[svc]
tab1=summary_table(models,xtrain,xtest,ytrain,ytest)
models=[svc_pca3]
tab2=summary_table2(models,pca_model_4,y)
models=[grid_svc_pca]
tab3=summary_table2(models,pca_model_4,y)
tab4=tab2.join(tab3)
tab4.rename(columns={'SVC()result':'SVC_PCA','GridSearcresult':'SVC_PCA_Grid'}, inplace = True)
display(tab1.join(tab4))
SVC_Train SVC_Test SVC_PCA SVC_PCA_Grid
Accuracy 0.685811 0.649606 0.796690 0.856974
Recall 0.655186 0.619053 0.786608 0.845519
Precision 0.656347 0.617826 0.762210 0.833839
F1-score 0.649558 0.614685 0.769622 0.839137
roc_auc_score 0.746460 0.716827 0.825663 0.878167

Performance is significantly increased on SVC with PCA Model and GridSearch on SVC PCA model.

5. Data Understanding & Cleaning

In [58]:
sns.pairplot(pd.DataFrame(X))
Out[58]:
<seaborn.axisgrid.PairGrid at 0x7c9ed17bb130>
In [59]:
sns.pairplot(pd.DataFrame(pca_model_4))
Out[59]:
<seaborn.axisgrid.PairGrid at 0x7c9ed17b91e0>

A. Explain pre-requisite/assumptions of PCA.

  1. There are 18 columns in the dataset. 18 features are very high to calculate and pedict.
  2. To deal with 18 feature takes much time, so it is better to reduce the features.
  3. Multicollinearity

B. Explain advantages and limitations of PCA

Advantages-

  • It prevents ovefitting.
  • It removes the high correlation amaimg features.
  • It improves visualization

Limitations-

  • For PCA ,sclaing the data is the imp step
  • On feature reduction, there can be a chance of information loss.
  • Manipulation of the data can be risky.